60 research outputs found
Russian word sense induction by clustering averaged word embeddings
The paper reports our participation in the shared task on word sense
induction and disambiguation for the Russian language (RUSSE-2018). Our team
was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th
for the bts-rnc and active-dict datasets (containing mostly polysemous words)
among all 19 participants.
The method we employed was extremely naive. It implied representing contexts
of ambiguous words as averaged word embedding vectors, using off-the-shelf
pre-trained distributional models. Then, these vector representations were
clustered with mainstream clustering techniques, thus producing the groups
corresponding to the ambiguous word senses. As a side result, we show that word
embedding models trained on small but balanced corpora can be superior to those
trained on large but noisy data - not only in intrinsic evaluation, but also in
downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational
Linguistics and Intellectual Technologies (Dialogue-2018
Redefining Context Windows for Word Embedding Models: An Experimental Study
Distributional semantic models learn vector representations of words through
the contexts they occur in. Although the choice of context (which often takes
the form of a sliding window) has a direct influence on the resulting
embeddings, the exact role of this model component is still not fully
understood. This paper presents a systematic analysis of context windows based
on a set of four distinct hyper-parameters. We train continuous Skip-Gram
models on two English-language corpora for various combinations of these
hyper-parameters, and evaluate them on both lexical similarity and analogy
tasks. Notable experimental results are the positive impact of cross-sentential
contexts and the surprisingly good performance of right-context windows
Redefining part-of-speech classes with distributional semantic models
This paper studies how word embeddings trained on the British National Corpus
interact with part of speech boundaries. Our work targets the Universal PoS tag
set, which is currently actively being used for annotation of a range of
languages. We experiment with training classifiers for predicting PoS tags for
words based on their embeddings. The results show that the information about
PoS affiliation contained in the distributional vectors allows us to discover
groups of words with distributional patterns that differ from other words of
the same part of speech.
This data often reveals hidden inconsistencies of the annotation process or
guidelines. At the same time, it supports the notion of `soft' or `graded' part
of speech affiliations. Finally, we show that information about PoS is
distributed among dozens of vector components, not limited to only one or two
features
Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants
This paper deals with using word embedding models to trace the temporal
dynamics of semantic relations between pairs of words. The set-up is similar to
the well-known analogies task, but expanded with a time dimension. To this end,
we apply incremental updating of the models with new training texts, including
incremental vocabulary expansion, coupled with learned transformation matrices
that let us map between members of the relation. The proposed approach is
evaluated on the task of predicting insurgent armed groups based on
geographical locations. The gold standard data for the time span 1994--2010 is
extracted from the UCDP Armed Conflicts dataset. The results show that the
method is feasible and outperforms the baselines, but also that important work
still remains to be done.Comment: to appear in EMNLP 2017 proceeding
UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection
We apply contextualised word embeddings to lexical semantic change detection
in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking
words by the degree of their semantic drift over time. We analyse the
performance of two contextualising architectures (BERT and ELMo) and three
change detection algorithms. We find that the most effective algorithms rely on
the cosine similarity between averaged token embeddings and the pairwise
distances between token embeddings. They outperform strong baselines by a large
margin (in the post-evaluation phase, we have the best Subtask 2 submission for
SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm
depends on the distribution of gold scores in the test set.Comment: To appear in Proceedings of the 14th International Workshop on
Semantic Evaluation (SemEval-2020
- …